This report explores the possibilities for identifying high quality red wines based on their chemical properties. Wine making and buying are often more art than science. Selecting a good wine for dinner from dozens of options is challenging for customers. Production also relies heavily on intuition and tradition. Although a tradition may well be good and correct, developing one is difficult and can take decades. An easier and more robust way to assess wine quality and issues in it would be beneficial.
As a first step towards an solution I will explore the wine quality dataset collected by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis (Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236). Dataset consists of measurements of various chemical properties of Portuguese vinho verde red wines and subjective quality assessment scores based on wine expert evaluations on a scale from 0 (very bad) to 10 (very excellent). My goal is to explore whether it is possible to identify good wines just based on the measurements of chemical properties. In order to do so, I plan first to explore the distributions and correlations of different features with quality scores. If promising features for classification are identified, I’d like to try fitting a few ‘quick and dirty’ classification models to the dataset as a proof of concept for a recommendation system.
More information on the dataset is available at: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
The dataset consists of 1599 observations of 12 variables. Variable X appears to be just an id. Quality of wines is measured by integers between 3 and 8. I suppose the full scale is from 1 to 10, but for some reason extreme values have not been used. Other variables are continuous measures of physical qualities of the wine.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
##
## 3 4 5 6 7 8
## 0.6 3.3 42.6 39.9 12.4 1.1
Distribution of wine qualities is bell-shaped with median 6 and mean 5.636. The left tail appears longer, but the right tail is heavier. It might make sense to combine categories, as some of them have only a few observations.
##
## Low High
## 1382 217
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 5.000 5.409 6.000 6.000
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.000 7.000 7.000 7.083 7.000 8.000
It might be easier to work with only two categories of wines instead of the full range of evaluations. The buyers are likely more interested in whether a wine is worth buying or not instead of exact ratings. Mean quality score for low quality wines is 5.4 and for high quality wines the mean is 7.1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity of the wines is concentated around the value 8, with some skew to the right. Most wines have the fixed acidity between 7 and 9.5. It will be interesting to see whether the best wines have the highest acidity. In that case they would be easy to identify.
wine[wine$fixed.acidity > 15, ]
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 443 443 15.6 0.685 0.76 3.7
## 555 555 15.5 0.645 0.49 4.2
## 556 556 15.5 0.645 0.49 4.2
## 558 558 15.6 0.645 0.49 4.2
## 653 653 15.9 0.360 0.65 7.5
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 443 0.100 6 43 1.00320 2.95
## 555 0.095 10 23 1.00315 2.92
## 556 0.095 10 23 1.00315 2.92
## 558 0.095 10 23 1.00315 2.92
## 653 0.096 22 71 0.99760 2.98
## sulphates alcohol quality quality.bin
## 443 0.68 11.2 7 High
## 555 0.74 11.1 5 Low
## 556 0.74 11.1 5 Low
## 558 0.74 11.1 5 Low
## 653 0.84 14.9 5 Low
There are a few outliers. Weird, it looks like three of them could be repeated measurements of the same wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1300 1300 7.6 1.58 0 2.1
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1300 0.137 5 9 0.99476 3.5
## sulphates alcohol quality quality.bin
## 1300 0.4 10.9 3 Low
Volatile acidity is much lower than fixed acidity in absolute terms: mean volatile acidity is 0.578, compared to order of magnitude higher mean for fixed acidity. The distribution appears to have low variance with a few outlier to the right. The biggest outlier is of poor quality. Someone got wine-making really wrong?
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
Increasing the resolution reveals an interesting chasm in the middle of the distribution. Why is this?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Many wines have zero or very little citric acid. Otherwise the distribution is quite flat until it starts to decrease around 0.5. There is a curious spike at this value, and a couple less distinct ones at lower values. It looks as if the wine makers might be aiming their wines to have the amount of citric acid either zero, 0.25 or 0.5. Maybe these spikes indicate different types of wines?
##
## 0 0.49 0.24 0.02 0.26 0.1 0.01 0.08 0.21 0.32 0.03 0.09 0.3 0.31 0.04
## 132 68 51 50 38 35 33 33 33 32 30 30 30 30 29
## 0.4 0.42 0.39 0.12 0.22 0.25 0.2 0.23 0.33 0.06 0.34 0.44 0.48 0.07 0.18
## 29 29 28 27 27 27 25 25 25 24 24 23 23 22 22
## 0.45 0.14 0.19 0.29 0.05 0.27 0.36 0.5 0.15 0.28 0.37 0.46 0.13 0.47 0.52
## 22 21 21 21 20 20 20 20 19 19 19 19 18 18 17
## 0.17 0.41 0.11 0.43 0.38 0.53 0.66 0.35 0.51 0.54 0.55 0.68 0.63 0.16 0.57
## 16 16 15 15 14 14 14 13 13 13 12 11 10 9 9
## 0.58 0.6 0.64 0.56 0.59 0.65 0.69 0.74 0.73 0.76 0.61 0.67 0.7 0.62 0.71
## 9 9 9 8 8 7 4 4 3 3 2 2 2 1 1
## 0.72 0.75 0.78 0.79 1
## 1 1 1 1 1
Actually the exact locations of the spikes are 0.24 and 0.49.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 34 34 6.9 0.605 0.12 10.7
## 325 325 10.0 0.490 0.20 11.0
## 326 326 10.0 0.490 0.20 11.0
## 481 481 10.6 0.280 0.39 15.5
## 1236 1236 6.0 0.330 0.32 12.9
## 1245 1245 5.9 0.290 0.25 13.4
## 1435 1435 10.2 0.540 0.37 15.4
## 1436 1436 10.2 0.540 0.37 15.4
## 1475 1475 9.9 0.500 0.50 13.8
## 1477 1477 9.9 0.500 0.50 13.8
## 1575 1575 5.6 0.310 0.78 13.9
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 34 0.073 40 83 0.99930 3.45
## 325 0.071 13 50 1.00150 3.16
## 326 0.071 13 50 1.00150 3.16
## 481 0.069 6 23 1.00260 3.12
## 1236 0.054 6 113 0.99572 3.30
## 1245 0.067 72 160 0.99721 3.33
## 1435 0.214 55 95 1.00369 3.18
## 1436 0.214 55 95 1.00369 3.18
## 1475 0.205 48 82 1.00242 3.16
## 1477 0.205 48 82 1.00242 3.16
## 1575 0.074 23 92 0.99677 3.39
## sulphates alcohol quality quality.bin
## 34 0.52 9.4 6 Low
## 325 0.69 9.2 6 Low
## 326 0.69 9.2 6 Low
## 481 0.66 9.2 5 Low
## 1236 0.56 11.5 4 Low
## 1245 0.54 10.3 6 Low
## 1435 0.77 9.0 6 Low
## 1436 0.77 9.0 6 Low
## 1475 0.75 8.8 5 Low
## 1477 0.75 8.8 5 Low
## 1575 0.48 10.5 6 Low
Most wines have low amount of residual sugar, between about 1 and 3. Some examples have much higher amounts of residual sugars: the highest outlier has about 6 times higher amount of residual sugar than the average wine. They are perhaps of different type, like desert wines? None of the outliers are high quality wines.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 18 18 8.1 0.560 0.28 1.7
## 20 20 7.9 0.320 0.51 1.8
## 43 43 7.5 0.490 0.20 2.6
## 82 82 7.8 0.430 0.70 1.9
## 84 84 7.3 0.670 0.26 1.8
## 107 107 7.8 0.410 0.68 1.7
## 152 152 9.2 0.520 1.00 3.4
## 170 170 7.5 0.705 0.24 1.8
## 227 227 8.9 0.590 0.50 2.0
## 259 259 7.7 0.410 0.76 1.8
## 282 282 7.7 0.270 0.68 3.5
## 292 292 11.0 0.200 0.48 2.0
## 452 452 8.4 0.370 0.53 1.8
## 693 693 8.6 0.490 0.51 2.0
## 731 731 9.5 0.550 0.66 2.3
## 755 755 7.8 0.480 0.68 1.7
## 1052 1052 8.5 0.460 0.59 1.4
## 1166 1166 8.5 0.440 0.50 1.9
## 1261 1261 8.6 0.635 0.68 1.8
## 1320 1320 9.1 0.760 0.68 1.7
## 1371 1371 8.7 0.780 0.51 1.7
## 1373 1373 8.7 0.780 0.51 1.7
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 18 0.368 16 56 0.99680 3.11
## 20 0.341 17 56 0.99690 3.04
## 43 0.332 8 14 0.99680 3.21
## 82 0.464 22 67 0.99740 3.13
## 84 0.401 16 51 0.99690 3.16
## 107 0.467 18 69 0.99730 3.08
## 152 0.610 32 69 0.99960 2.74
## 170 0.360 15 63 0.99640 3.00
## 227 0.337 27 81 0.99640 3.04
## 259 0.611 8 45 0.99680 3.06
## 282 0.358 5 10 0.99720 3.25
## 292 0.343 6 18 0.99790 3.30
## 452 0.413 9 26 0.99790 3.06
## 693 0.422 16 62 0.99790 3.03
## 731 0.387 12 37 0.99820 3.17
## 755 0.415 14 32 0.99656 3.09
## 1052 0.414 16 45 0.99702 3.03
## 1166 0.369 15 38 0.99634 3.01
## 1261 0.403 19 56 0.99632 3.02
## 1320 0.414 18 64 0.99652 2.90
## 1371 0.415 12 66 0.99623 3.00
## 1373 0.415 12 66 0.99623 3.00
## sulphates alcohol quality quality.bin
## 18 1.28 9.3 5 Low
## 20 1.08 9.2 6 Low
## 43 0.90 10.5 6 Low
## 82 1.28 9.4 5 Low
## 84 1.14 9.4 5 Low
## 107 1.31 9.3 5 Low
## 152 2.00 9.4 4 Low
## 170 1.59 9.5 5 Low
## 227 1.61 9.5 6 Low
## 259 1.26 9.4 5 Low
## 282 1.08 9.9 7 High
## 292 0.71 10.5 5 Low
## 452 1.06 9.1 6 Low
## 693 1.17 9.0 5 Low
## 731 0.67 9.6 5 Low
## 755 1.06 9.1 6 Low
## 1052 1.34 9.2 5 Low
## 1166 1.10 9.4 5 Low
## 1261 1.15 9.3 5 Low
## 1320 1.33 9.1 6 Low
## 1371 1.17 9.2 5 Low
## 1373 1.17 9.2 5 Low
Distribution of chlorides resembles the one of residual sugar. Most values are thightly concentrated around 0.08 with a thin and long right tail all the way to 0.6. It looks as if there is a small concentration of wines around 0.4. Is this a distinct subtype or category of wines, or just an artefact in the data? With one exception the outliers are low quality wines. They are all dry (low residual sugar) and low on alcohol.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Most wines have lowish amounts of free sulfur dioxide. The distribution is again right-skewed. In absolute terms the differences are large, from 1 g/l to 72 g/l.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Same story here as with free sulfur dioxide, but about an order of magnitude higher values. I wonder what is the relationship between free and total amounts of sulfur dioxide?
There is a slightly increasing trend in additional sulfur dioxide when amount of free sulfur dioxide increases. It is still quite common for most of the total sulfur dioxide being accounted for by free sulfur dioxide. I’m creating a new variable fixed.sulfur.dioxide by calculating the difference between the two measures.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 21.00 30.59 39.00 251.50
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1080 1080 7.9 0.3 0.68 8.3
## 1082 1082 7.9 0.3 0.68 8.3
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1080 0.05 37.5 278 0.99316 3.01
## 1082 0.05 37.5 289 0.99316 3.01
## sulphates alcohol quality quality.bin fixed.sulfur.dioxide
## 1080 0.51 12.3 7 High 240.5
## 1082 0.51 12.3 7 High 251.5
Similar distribution as with total sulfur dioxide. There are two extreme outliers, both of them high quality. Features of these two wines are curiously similar: the only difference is in amount of total sulfur dioxide. Duplicate?
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density is almost normally distributed around little less than 1, which makes sense as wine is mostly water, and alcohol is less dense than water. Density might actually be correlated with amount of alcholol.
There indeed is a downward trend with increasing alcohol levels. Stronger wines tend to be less dense.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 46 46 4.6 0.52 0.15 2.1
## 96 96 4.7 0.60 0.17 2.3
## 152 152 9.2 0.52 1.00 3.4
## 696 696 5.1 0.47 0.02 1.3
## 1317 1317 5.4 0.74 0.00 1.2
## 1322 1322 5.0 0.74 0.00 1.2
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 46 0.054 8 65 0.99340 3.90
## 96 0.058 17 106 0.99320 3.85
## 152 0.610 32 69 0.99960 2.74
## 696 0.034 18 44 0.99210 3.90
## 1317 0.041 16 46 0.99258 4.01
## 1322 0.041 16 46 0.99258 4.01
## sulphates alcohol quality quality.bin fixed.sulfur.dioxide
## 46 0.56 13.1 4 Low 57
## 96 0.60 12.9 6 Low 89
## 152 2.00 9.4 4 Low 37
## 696 0.62 12.8 6 Low 26
## 1317 0.59 12.5 6 Low 30
## 1322 0.59 12.5 6 Low 30
pH of the wines is almost normally distributed around the mean pH 3.3. Wines are acidic. The few minor outliers are not interesting.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 14 14 7.8 0.610 0.29 1.6
## 87 87 8.6 0.490 0.28 1.9
## 92 92 8.6 0.490 0.28 1.9
## 93 93 8.6 0.490 0.29 2.0
## 152 152 9.2 0.520 1.00 3.4
## 170 170 7.5 0.705 0.24 1.8
## 227 227 8.9 0.590 0.50 2.0
## 724 724 7.1 0.310 0.30 2.2
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 14 0.114 9 29 0.9974 3.26
## 87 0.110 20 136 0.9972 2.93
## 92 0.110 20 136 0.9972 2.93
## 93 0.110 19 133 0.9972 2.93
## 152 0.610 32 69 0.9996 2.74
## 170 0.360 15 63 0.9964 3.00
## 227 0.337 27 81 0.9964 3.04
## 724 0.053 36 127 0.9965 2.94
## sulphates alcohol quality quality.bin fixed.sulfur.dioxide
## 14 1.56 9.1 5 Low 20
## 87 1.95 9.9 6 Low 116
## 92 1.95 9.9 6 Low 116
## 93 1.98 9.8 5 Low 114
## 152 2.00 9.4 4 Low 37
## 170 1.59 9.5 5 Low 48
## 227 1.61 9.5 6 Low 54
## 724 1.62 9.5 5 Low 91
A relatively tight distribution with some skew and outliers to the right. Typical wines have sulphates between 0.5 and 0.8. All outliers are low quality wines low on residual sugar and alcohol.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
##
## 9.5 9.4 9.8 9.2 10 10.5 9.3 9.6 11 9.7 9.9 10.9
## 139 103 78 72 67 67 59 59 59 54 49 49
## 10.1 10.2 10.8 10.4 11.2 10.3 11.3 11.4 9 11.5 11.8 10.6
## 47 46 42 41 36 33 32 32 30 30 29 28
## 10.7 11.1 9.1 11.7 12 12.5 11.9 12.8 11.6 12.1 12.4 12.2
## 27 27 23 23 21 21 20 17 15 13 13 12
## 12.3 12.7 12.9 14 12.6 13 13.6 13.3 13.4 8.4 8.7 8.8
## 12 9 9 7 6 6 4 3 3 2 2 2
## 9.55 10.03 10.55 13.1 8.5 9.05 9.23 9.25 9.57 9.95 10.75 11.07
## 2 2 2 2 1 1 1 1 1 1 1 1
## 11.95 13.2 13.5 13.57 14.9
## 1 1 1 1 1
Wines typically have at least 9 % alcohol, around 10 % being the average and number of wines slowly decreasing as the alcohol content increases. Wine makers seem to prefer round numbers in alchohol content. There are spikes in the distribution around every .0 and .5.
The red wine dataset consists of 1599 observations of 12 variables (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality). Quality is an ordered categorical variable on a scale from 3 to 8, larger values being the better. Other variables are continuous.
Most wines (82.5 %) have a quality rating of 5 or 6. 7 is the third most common rating (12.4 %) while all the other quality scores cover only 5 % of the wines. Red wine is acidic (pH 2.7-4.0) and usually has only little residual sugar. Mean alcohol content of wines is 10.4 %.
The main feature of interest is quality. I’d like to be able to classify wines to high (quality 7 or 8) and low quality (quality 6 or lower) categories based on some combination of physical measures.
Based on the shapes of distributions, volatile acidity, citric acid, cholrides and alcohol seem promising. Especially alcohol and citric acid distributions feature curious spikes at round values, suggesting the wine makers might be aiming to have specific characteristics on these features, which implies the winemakers believe those features have something to do with the quality of the wine.
I created variable fixed sulfur dioxide by subracting free sulfur dioxide from total sulfur dioxide. I also combined quality categories into a new binary variable quality.bin. In this variable ‘high’ is assigned to wines with quality 7 or 8 and the ‘low’ is assigned to all other wines.
Several of the distributions were right-skewed. I tried a few transformations (logarithmic, cubic root, power) on some of them, but the shapes of the distributions did not improve. In the end I used the features as they are. (With hindsight, classification models could have benefitted from normalization.)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## fixed.sulfur.dioxide -0.07814929 0.097033939 0.06677604
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## fixed.sulfur.dioxide 0.174529035 0.055479649 0.425148917
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## fixed.sulfur.dioxide 0.95768634 0.09513464 -0.10805328
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
## fixed.sulfur.dioxide 0.032244043 -0.22320257 -0.20546298
## fixed.sulfur.dioxide
## fixed.acidity -0.07814929
## volatile.acidity 0.09703394
## citric.acid 0.06677604
## residual.sugar 0.17452903
## chlorides 0.05547965
## free.sulfur.dioxide 0.42514892
## total.sulfur.dioxide 0.95768634
## density 0.09513464
## pH -0.10805328
## sulphates 0.03224404
## alcohol -0.22320257
## quality -0.20546298
## fixed.sulfur.dioxide 1.00000000
## wine[, 14]: Low
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.600 Min. :0.160 Min. :0.0000 Min. : 0.900
## 1st Qu.: 7.100 1st Qu.:0.420 1st Qu.:0.0825 1st Qu.: 1.900
## Median : 7.800 Median :0.540 Median :0.2400 Median : 2.200
## Mean : 8.237 Mean :0.547 Mean :0.2544 Mean : 2.512
## 3rd Qu.: 9.100 3rd Qu.:0.650 3rd Qu.:0.4000 3rd Qu.: 2.600
## Max. :15.900 Max. :1.580 Max. :1.0000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.03400 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07100 1st Qu.: 8.00 1st Qu.: 23.00
## Median :0.08000 Median :14.00 Median : 39.50
## Mean :0.08928 Mean :16.17 Mean : 48.29
## 3rd Qu.:0.09100 3rd Qu.:22.00 3rd Qu.: 65.00
## Max. :0.61100 Max. :72.00 Max. :165.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9958 1st Qu.:3.210 1st Qu.:0.5400 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6000 Median :10.00
## Mean :0.9969 Mean :3.315 Mean :0.6448 Mean :10.25
## 3rd Qu.:0.9979 3rd Qu.:3.410 3rd Qu.:0.7000 3rd Qu.:10.90
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## fixed.sulfur.dioxide
## Min. : 3.00
## 1st Qu.: 12.00
## Median : 23.00
## Mean : 32.11
## 3rd Qu.: 42.00
## Max. :128.00
## --------------------------------------------------------
## wine[, 14]: High
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.900 Min. :0.1200 Min. :0.0000 Min. :1.200
## 1st Qu.: 7.400 1st Qu.:0.3000 1st Qu.:0.3000 1st Qu.:2.000
## Median : 8.700 Median :0.3700 Median :0.4000 Median :2.300
## Mean : 8.847 Mean :0.4055 Mean :0.3765 Mean :2.709
## 3rd Qu.:10.100 3rd Qu.:0.4900 3rd Qu.:0.4900 3rd Qu.:2.700
## Max. :15.600 Max. :0.9150 Max. :0.7600 Max. :8.900
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 3.00 Min. : 7.00
## 1st Qu.:0.06200 1st Qu.: 6.00 1st Qu.: 17.00
## Median :0.07300 Median :11.00 Median : 27.00
## Mean :0.07591 Mean :13.98 Mean : 34.89
## 3rd Qu.:0.08500 3rd Qu.:18.00 3rd Qu.: 43.00
## Max. :0.35800 Max. :54.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9906 Min. :2.880 Min. :0.3900 Min. : 9.20
## 1st Qu.:0.9947 1st Qu.:3.200 1st Qu.:0.6500 1st Qu.:10.80
## Median :0.9957 Median :3.270 Median :0.7400 Median :11.60
## Mean :0.9960 Mean :3.289 Mean :0.7435 Mean :11.52
## 3rd Qu.:0.9973 3rd Qu.:3.380 3rd Qu.:0.8200 3rd Qu.:12.20
## Max. :1.0032 Max. :3.780 Max. :1.3600 Max. :14.00
## fixed.sulfur.dioxide
## Min. : 4.00
## 1st Qu.: 9.00
## Median : 14.00
## Mean : 20.91
## 3rd Qu.: 22.00
## Max. :251.50
Volatile acidity, citric acid, sulphates and alcohol have moderate correlations with wine quality. These features also have have noticably different means between low and high quality wines.
No clear trends here. Poor and good wines seem to have higher fixed acidity, but on the other hand there are only a few data points on them, so the effect does not feel very trustworthy.
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 7.100 7.800 8.237 9.100 15.900
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.700 8.847 10.100 15.600
Comparing only two quality categories reveals that actually high quality wines tend to have higher fixed acidity. Especially the difference between median values is noticable. Combining quality categories is starting to look like a good idea.
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.160 0.420 0.540 0.547 0.650 1.580
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4055 0.4900 0.9150
There is a clear decreasing trend with volatile acidity when the wine quality increases. High quality wines have lower values in all quartiles.
With the increasing quality the distribution of volatile acidity moves to left and gets narrower.
The lower the volatile acidity, the more likely the wine is to be of high quality. Looks like about 0.38 volatile acidity is the sweet spot for red wines.
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0825 0.2400 0.2544 0.4000 1.0000
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3000 0.4000 0.3765 0.4900 0.7600
The very best wines tend to have higher amounts of citric acid.
Interesting! On average good wines tend to have a lot of citric acid, but the density plot reveals the picture is more complex. There seems to be three kinds of wines regarding citric acid: low (close to 0), medium (~0.25) and high (~0.4) amounts of citric acid. Good wines have either a little or a lot of citric acid, while other wines can have any amount of it.
Lots of outliers..
ggplot(wine, aes(as.factor(quality), residual.sugar)) +
geom_boxplot() +
ylim(0, 4)
## Warning in loop_apply(n, do.ply): Removed 125 rows containing non-finite
## values (stat_boxplot).
ggplot(wine, aes(quality.bin, residual.sugar)) +
geom_boxplot() +
ylim(0, 4)
## Warning in loop_apply(n, do.ply): Removed 125 rows containing non-finite
## values (stat_boxplot).
ggplot(wine, aes(residual.sugar, color = quality.bin)) +
geom_density()
by(wine$residual.sugar, wine$quality.bin, summary)
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.512 2.600 15.500
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.709 2.700 8.900
Nothing interesting going on here.
Again many outliers.
ggplot(wine, aes(as.factor(quality), chlorides)) +
geom_boxplot() +
ylim(0, 0.2)
## Warning in loop_apply(n, do.ply): Removed 41 rows containing non-finite
## values (stat_boxplot).
ggplot(wine, aes(quality.bin, chlorides)) +
geom_boxplot() +
ylim(0, 0.2)
## Warning in loop_apply(n, do.ply): Removed 41 rows containing non-finite
## values (stat_boxplot).
ggplot(wine, aes(chlorides, color = quality.bin)) +
geom_density()
by(wine$chlorides, wine$quality.bin, summary)
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.07100 0.08000 0.08928 0.09100 0.61100
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07591 0.08500 0.35800
Good wines appear to have slightly lower amounts of chlorides overall: 0.076 vs 0.089 on average. However, there is a lot of overlap between the distributions.
Quite many outliers.
ggplot(wine, aes(as.factor(quality), free.sulfur.dioxide)) +
geom_boxplot() +
ylim(0, 30)
## Warning in loop_apply(n, do.ply): Removed 163 rows containing non-finite
## values (stat_boxplot).
ggplot(wine, aes(quality.bin, free.sulfur.dioxide)) +
geom_boxplot() +
ylim(0, 30)
## Warning in loop_apply(n, do.ply): Removed 163 rows containing non-finite
## values (stat_boxplot).
ggplot(wine, aes(free.sulfur.dioxide, color = quality.bin)) +
geom_density()
by(wine$free.sulfur.dioxide, wine$quality, summary)
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 11.0 14.5 34.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 12.26 15.00 41.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 15.00 16.98 23.00 68.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 15.71 21.00 72.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 14.05 18.00 54.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 7.50 13.28 16.50 42.00
Average quality wines seem to have a little more free sulfur dioxide on average, but this does not help much in differentiating high quality wines from others. Poor wines (quality score 4) and the best wines (quality score 8) have about the same amount of free sulfur dioxide on average.
Outliers.
ggplot(wine, aes(as.factor(quality), total.sulfur.dioxide)) +
geom_boxplot() +
ylim(0, 120)
## Warning in loop_apply(n, do.ply): Removed 62 rows containing non-finite
## values (stat_boxplot).
ggplot(wine, aes(quality.bin, total.sulfur.dioxide)) +
geom_boxplot() +
ylim(0, 120)
## Warning in loop_apply(n, do.ply): Removed 62 rows containing non-finite
## values (stat_boxplot).
ggplot(wine, aes(total.sulfur.dioxide, color = quality.bin)) +
geom_density() +
xlim(0, 120)
## Warning in loop_apply(n, do.ply): Removed 60 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing non-finite
## values (stat_density).
by(wine$total.sulfur.dioxide, wine$quality.bin, summary)
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 39.50 48.29 65.00 165.00
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.00 27.00 34.89 43.00 289.00
Total sulfur dioxide is a better indicator of whether a wine is good or bad. Good wines have almost 30 % less total sulfur dioxide than poor wines, on average.
## Warning in loop_apply(n, do.ply): Removed 2 rows containing missing values
## (geom_point).
High quality wines seem to be along a line where amount of total sulfur dioxide compared to free sulful dioxide is low
## Warning in loop_apply(n, do.ply): Removed 163 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 62 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 10 rows containing non-finite
## values (stat_boxplot).
The pattern is not very clear, though. There is a lot of overlap between the distributions
## Warning in loop_apply(n, do.ply): Removed 2 rows containing missing values
## (geom_point).
Getting better…
## Warning in loop_apply(n, do.ply): Removed 10 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 135 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 8 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing non-finite
## values (stat_density).
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 23.00 32.11 42.00 128.00
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 9.00 14.00 20.91 22.00 251.50
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1080 1080 7.9 0.3 0.68 8.3
## 1082 1082 7.9 0.3 0.68 8.3
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1080 0.05 37.5 278 0.99316 3.01
## 1082 0.05 37.5 289 0.99316 3.01
## sulphates alcohol quality quality.bin fixed.sulfur.dioxide
## 1080 0.51 12.3 7 High 240.5
## 1082 0.51 12.3 7 High 251.5
Fixed sulfur dioxide is even better discriminator than total sulfur dioxide! Good wines have low amounts of fixed sulfur dioxide. The differences between values for quantiles, mean and median are around 30 %. There are two extreme outliers, that seem to be related or duplicated wines. The only difference is in amount of total sulfur dioxide.
## Warning in loop_apply(n, do.ply): Removed 1 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 34 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 5 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 40 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing non-finite
## values (stat_density).
Looks like low amounts of fixed sulfur dioxide is a pre-requisite but not a guarantee for a high wine quality.
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9958 0.9968 0.9969 0.9979 1.0040
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9947 0.9957 0.9960 0.9974 1.0030
Higher quality wines seem to have lower density. They also have more alcohol, which could cause the correlation. It is probably a good idea to explore how different things affect the density of the wine.
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.315 3.410 4.010
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.270 3.289 3.380 3.780
Better wines tend to have slightly lower pH, perhaps in connection to better wines having often higher fixed acidity. The pattern is not very clear though.
Outliers ruining a plot again.
ggplot(wine, aes(as.factor(quality), sulphates)) +
geom_boxplot() +
ylim(0, 1)
## Warning in loop_apply(n, do.ply): Removed 58 rows containing non-finite
## values (stat_boxplot).
ggplot(wine, aes(quality.bin, sulphates)) +
geom_boxplot() +
ylim(0, 1)
## Warning in loop_apply(n, do.ply): Removed 58 rows containing non-finite
## values (stat_boxplot).
ggplot(wine, aes(sulphates, color = as.factor(quality))) +
geom_density() +
xlim(0, 1.5)
## Warning in loop_apply(n, do.ply): Removed 1 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 4 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 3 rows containing non-finite
## values (stat_density).
ggplot(wine, aes(sulphates, color = quality.bin)) +
geom_density() +
xlim(0, 1.5)
## Warning in loop_apply(n, do.ply): Removed 8 rows containing non-finite
## values (stat_density).
by(wine$sulphates, wine$quality.bin, summary)
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5400 0.6000 0.6448 0.7000 2.0000
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7435 0.8200 1.3600
Higher amounts of sulphates are associated with higher quality, but there are many outliers in average quality wines that muddy the relationship.
The pattern with alcohol is a little bit U-shaped. The worst quality wines tend to have more alcohol than average wines, and then the better than average wines have increasing amounts of alcohol.
With lower resolution the pattern becomes clearer. Better wines tend to have higher amounts of alcohol.
## wine$quality.bin: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## wine$quality.bin: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
Wines with more than 12 % alcohol are likely to have high quality, and wines with less than 10 % alcohol are likely poor.
Many of the features are associated with density, which makes sense. Fixed acidity and alcohol seem to have the strongest association. pH has strong correlations with acidity measures, so its association with density is likely to be result of that.
The higher the acidity, the lower the pH. Surprisingly, higher volatile acidity has a weak correlation with higher pH. Maybe volatile acids are “escaping” from the wine?
Fixed and volatile acidity do not have much to do with each other, but citric acid has to do with both of them! Citric acid has positive correlation with fixed acidity and negative correlation with volatile acidity. These relationships look somewhat nonlinear.
Looks like fixed acidity is linearly related to citric acid to some power, perhaps 4.
Here the relationship looks most linear when citric acid values are squared, but the approximation is very rough.
Merging quality categories to just high (7 or 8) and low (6 or below) turned out to be helpful in clarifying the differences between wines. In summary, high quality wines tend to have relatively:
In addition to relationships between physical measures and quality I investigated the composition of acidity in more detail, because two acidity measures correlated with quality, and they interact with each other. Interestingly volatile acidity has negative correlation and fixed acidity has positive correlation with citric acid, but volatile and fixed acidity do not correlate much with each other. The relationships seem to be linear in some power of citric acid, perhaps around 2 (volatile acidity) and 4 (fixed acidity). I also looked at the composition of density, which is likely a result of other measured physical properties.
The strongest correlation I found was between total sulfur dioxide and fixed sulfur dioxide, but the correlation is a result of how the variable was created. After that fixed acidity and pH have the highest correlation (-0.68). Other similarly strong correlations include:
However, for the most interesting correlation is between alcohol and quality (0.48). High quality wines tend to have a lot of alcohol.
## wine$quality.bin: Low
## fixed.acidity volatile.acidity
## fixed.acidity 1.0000000 -0.2313619
## volatile.acidity -0.2313619 1.0000000
## --------------------------------------------------------
## wine$quality.bin: High
## fixed.acidity volatile.acidity
## fixed.acidity 1.0000000 -0.2651239
## volatile.acidity -0.2651239 1.0000000
## wine$quality.bin: Low
## volatile.acidity citric.acid
## volatile.acidity 1.0000000 -0.5313932
## citric.acid -0.5313932 1.0000000
## --------------------------------------------------------
## wine$quality.bin: High
## volatile.acidity citric.acid
## volatile.acidity 1.000000 -0.494798
## citric.acid -0.494798 1.000000
## wine$quality.bin: Low
## fixed.acidity citric.acid
## fixed.acidity 1.0000000 0.6522584
## citric.acid 0.6522584 1.0000000
## --------------------------------------------------------
## wine$quality.bin: High
## fixed.acidity citric.acid
## fixed.acidity 1.0000000 0.7452792
## citric.acid 0.7452792 1.0000000
Plotting the wines based on their acidity measures reveals two clusters of high-quality wines:
Although there is overlap, many low quality wines could already be identified from this plot: wines presented with red dots above the black line and grey dots below it are likely to be of poor quality (the location of the line is approximate and only for illustration). Interestingly the correlations between acidity do not change markedly between the quality classes. Effect is probably an interaction between features.
## wine$quality.bin: Low
## acidity.inter citric.acid
## acidity.inter 1.00000000 -0.03304717
## citric.acid -0.03304717 1.00000000
## --------------------------------------------------------
## wine$quality.bin: High
## acidity.inter citric.acid
## acidity.inter 1.0000000 -0.1986388
## citric.acid -0.1986388 1.0000000
Low quality wines have almost no correlation between citric.acid and the new interaction term, while in high quality wines the correlation is moderate.
## Warning in loop_apply(n, do.ply): Removed 4 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing missing values
## (geom_point).
## wine$quality.bin: Low
## sulphates fixed.sulfur.dioxide
## sulphates 1.00000000 0.07797316
## fixed.sulfur.dioxide 0.07797316 1.00000000
## --------------------------------------------------------
## wine$quality.bin: High
## sulphates fixed.sulfur.dioxide
## sulphates 1.00000000 -0.06161471
## fixed.sulfur.dioxide -0.06161471 1.00000000
## wine$quality.bin: Low
## alcohol fixed.sulfur.dioxide
## alcohol 1.0000000 -0.2388809
## fixed.sulfur.dioxide -0.2388809 1.0000000
## --------------------------------------------------------
## wine$quality.bin: High
## alcohol fixed.sulfur.dioxide
## alcohol 1.0000000 0.1620387
## fixed.sulfur.dioxide 0.1620387 1.0000000
## wine$quality.bin: Low
## sulphates alcohol
## sulphates 1.00000000 0.02220524
## alcohol 0.02220524 1.00000000
## --------------------------------------------------------
## wine$quality.bin: High
## sulphates alcohol
## sulphates 1.00000000 -0.05229298
## alcohol -0.05229298 1.00000000
Good quality wines form a rather tight cluster. Again it is possible to identify many poor quality wines visually: any grey wine and all wines above the black line are likely to be of poor quality. This time correlations between features are different for different wine qualities. For instance, in low quality wines increases in fixed sulfur dioxide are negatively correlated with alcohol, but in high quality wines the relationship is reverse. Correlations between sulphates and fixed sulfur dioxide ans sulphates and alchol show similar patterns, but to a lesser extent. Classification algorithms could probably do a good job at identifying high-quality wines (quality 7 or 8) using the following features:
Next three classification models are trained on the promising features and their performance is assesed. The goal is to achieve a proof of concept for identifying high quality wines based on the chemical properties. Therefore the models are used ‘out of the box’. Model parameters are not optimized and outliers are not removed from the data. The purpose of the models is just to quickly validate the hypothesis that the identified promising features are useful for predicting wine quality.
First the data is randomly split to training and test sets. 70 % of the data is used for training the models and the rest is spared for evaluating their accuracy. The classification models used are k-nearest neighbors (KNN), support vector machine (SVM) and random forest. They are all well-known algorithms that usually perform well, and use different approaches in modeling.
Following parameters for the models are used.
KNN: k = 3. This a typical value for number of neighbors to consider in prediction.
SVM: scale = TRUE, kernel = ‘radial’, gamma = 1/6 (1/(data dimension)), cost (C) = 1. These parameters are the defaults in support vector machine implementation in e1071 package. Features are normalized, and a radial kernel is used.
Random Forest: ntree = 500, mtry = sqrt(6), replace = TRUE, cutoff = 1/2, nodesize = 1. There parameters are the defaults for randomForest in randomForest package. 500 trees are grown to the maximum size, where minimum number of nodes in a leaf is 1.
## K nearest neighbors predictions:
##
## prediction_1 Low High
## Low 391 40
## High 26 23
## [1] 0.86
## Support vector machine predictions:
##
## prediction_2 Low High
## Low 408 50
## High 9 13
## [1] 0.88
## Random forest predictions:
##
## prediction_3 Low High
## Low 405 32
## High 12 31
## [1] 0.91
Indeed, k nearest neigbors, support vector machine and random forest all work pretty well even without any optimization. In this case random forest has the best performance, achieving 91 % classification accuracy on the test set. Precision of identifying high quality wines is 0.49 and recall is 0.72. In practical terms, if the random forest model is correct about half of the time when it predicts a wine is good, and about 97 % of the time when it predicts a wine is not good. Because the baseline changes of a wine being of high quality in the test set is only 13 %, the result is a significant improvement.
Combination of different acidity measures (fixed acidity, volatile acidity and citric acid) turned out to be useful in visually differentiating between high and low quality wines, as did the combination of fixed sulfur dioxide, sulphates and alcohol. The clustering looked much tighter than I had expected based on the bivariate comparisons. After seeing these plots it was not a surprise that classification models performed well at predicting wine quality.
The biggest surprise was the interaction between sulphates and fixed sulphur dioxide. Neither of them was a strong canditate as a predictor, but together they collected high-quality wines in a tight cluster.
I tried three classification models on the promising features identified during the exploratory analysis. All of them worked well “out of the box”, acchieving around 90 % accuracy. In this case random forest had the best performance with 91 % accuracy on the test set. This means that based on six physical measures of the wine, the random forest model can correctly predict nine times out of ten whether the wine is of high quality. The performance of models could likely be improved further by little optimization. For instance, k-value in k-nearest neighbors model was pulled from a hat, and other models were fitted with default parameters.
This report set out to explore the red wine dataset originally collected by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis (Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236). The goal was to find out whether it would be possible to identify high quality wines, as defined by subjective expert evaluations, just based on a few of their measurable chemical properties. The dataset consists of 11 physical measurements of 1599 Portuguese vinho verde red wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## [1] 0.8075694
##
## Low (3-6) High (7-8)
## 1382 217
Wine quality is measured on a discrete scale from 0 (very bad) to 10 (very excellent). The distribution of quality scores for the wines is bell-shaped, with median 6 and standard deviation 0.81. Extreme scores are not used and in practice wine quality varies between 3 and 8. As many of the categories have relatively few observations, and the main interest is in differentiating good wines from the rest, it makes sense to combine categories. Only a minority of wines - 217 out of 1599, or 13.5 % - is of high quality.
## wine$quality.bin: Low (3-6)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## wine$quality.bin: High (7-8)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
## [1] 0.4761663
Amount of alcohol has the strongest association with wine quality. Better wines tend to have more alcohol, 11.5 % on average, in contrast to 10.25 % in low quality wines. The correlation between wine quality (original 10-point scale) and amount of alcohol is 0.48. Wines with more than 11 % alcohol are likely to be high quality wines.
## wine$quality.bin: Low (3-6)
## fixed.acidity volatile.acidity citric.acid
## Min. : 4.600 Min. :0.160 Min. :0.0000
## 1st Qu.: 7.100 1st Qu.:0.420 1st Qu.:0.0825
## Median : 7.800 Median :0.540 Median :0.2400
## Mean : 8.237 Mean :0.547 Mean :0.2544
## 3rd Qu.: 9.100 3rd Qu.:0.650 3rd Qu.:0.4000
## Max. :15.900 Max. :1.580 Max. :1.0000
## --------------------------------------------------------
## wine$quality.bin: High (7-8)
## fixed.acidity volatile.acidity citric.acid
## Min. : 4.900 Min. :0.1200 Min. :0.0000
## 1st Qu.: 7.400 1st Qu.:0.3000 1st Qu.:0.3000
## Median : 8.700 Median :0.3700 Median :0.4000
## Mean : 8.847 Mean :0.4055 Mean :0.3765
## 3rd Qu.:10.100 3rd Qu.:0.4900 3rd Qu.:0.4900
## Max. :15.600 Max. :0.9150 Max. :0.7600
## wine$quality.bin: Low (3-6)
## acidity.inter citric.acid
## acidity.inter 1.00000000 -0.03304717
## citric.acid -0.03304717 1.00000000
## --------------------------------------------------------
## wine$quality.bin: High (7-8)
## acidity.inter citric.acid
## acidity.inter 1.0000000 -0.1986388
## citric.acid -0.1986388 1.0000000
Three acidity measures, fixed acidity (tartaric acid), volatile acidity (acetic acid), and, and citric acid, differentiate high-quality wines from low-quality wines rather well, although the relationship is not straight-forward. On average high quality wines tend to have higher fixed acidity (8.8 vs. 8.2 g/l) and citric acid (0.38 vs. 0.25 g/l), and lower volatile acidity (0.41 vs. 0.55 g/l) than low quality wines. However, the largest differences come up when interactions between different acidity measures are taken into account. Low quality wines show almost no correlation (-0.03) between the interaction term of volatile and fixed acidity (volatile acidity times fixed acidity) and citric acid, which means that low quality wines can have wide range of citric acid levels regardless of the the interaction of the two other measures. With high quality wines this picture changes. The amount of citric acid tends to decrease in high quality wines as the value of interaction term between volatile and fixed acidity increases. The correlation is moderate -0.20. As a result of these interactions, it is possible to identify clusters of high and low quality wines even visually, as the above plot demonstrates. Dashed lines help illustrate the borders of distinct clusters. In the plot, wines represented by red dots above the dashed line and by grey dots below it are likely to have low quality. The situation is similar whe interactions between amounts of sulphates, fixed sulfur dioxide and alcohol are investigated.
Finally, the usability of the promising features (fixed acidity, volatile acidity, citric acid, sulphates, fixed sulfur dioxide and alcohol) for identification of high quality wines was tested by trying to predict the wine quality using k-nearest neigbors, support vector machine and random forest algorithms. The purpose of these models was to act as a proof of concept, so parameter optimization and outlier removal were skipped. Still, the best performing algorithm, random forest, was able to acchieve 91 % overall classification accuracy, and 49 % precision and 72 % recall on identifying high qualty wines on the test set.
The dataset I explored contained physical measurements of 1599 red wines, along with subjective quality scores. I started by investigating the distributions of individual variables. After that I identified promising correlations between the variables in an effort to find a set of features that could be used to predict wine quality. Instead of the full quality scale I was only interested in differentiating good wines (score 7 or 8) from the rest. Initially the dataset felt confusing and it didn’t look like there was any interesting patterns, but systematically plotting comparisons of variables slowly revealed many interesting relationships. During the analysis the largest surprise was that some of the variables that did not look very good predictors alone, worked very well when combined together. In the end I used six identified features to build three classification models. Random forest performed best, achieving 91 % classification accuracy on the test set. The performance of the models could be improved by optimization and possibly by adding new features. Some of the less promising features could still contain useful information.
I made a couple of detours during the analysis by investigating the relationships of density and pH with other variables they had high correlations with, and by looking to interactions of different acidity measures in detail. In the end I did not get anything useful out of this exploration, except perhaps the decision to leave density and pH out of further analysis, because the information they contain seemed likely to be already accounted for by other variables.